深度神经网络和其他现代机器学习模型的培训通常包括解决高维且受大规模数据约束的非凸优化问题。在这里,基于动量的随机优化算法在近年来变得尤其流行。随机性来自数据亚采样,从而降低了计算成本。此外,动量和随机性都应该有助于算法克服当地的最小化器,并希望在全球范围内融合。从理论上讲,这种随机性和动量的结合被糟糕地理解。在这项工作中,我们建议并分析具有动量的随机梯度下降的连续时间模型。该模型是一个分段确定的马尔可夫过程,它通过阻尼不足的动态系统和通过动力学系统的随机切换来代表粒子运动。在我们的分析中,我们研究了长期限制,子采样到无填充采样极限以及动量到非摩托车的限制。我们对随着时间的推移降低动量的情况特别感兴趣:直觉上,动量有助于在算法的初始阶段克服局部最小值,但禁止后来快速收敛到全球最小化器。在凸度的假设下,当降低随时间的动量时,我们显示了动力学系统与全局最小化器的收敛性,并让子采样率转移到无穷大。然后,我们提出了一个稳定的,合成的离散方案,以从我们的连续时间动力学系统中构造算法。在数值实验中,我们研究了我们在凸面和非凸测试问题中的离散方案。此外,我们训练卷积神经网络解决CIFAR-10图像分类问题。在这里,与动量相比,我们的算法与随机梯度下降相比达到了竞争性结果。
translated by 谷歌翻译
实用的图像分割任务涉及必须从嘈杂,扭曲和/或不完整的观察值重建的图像。解决此类任务的最新方法是使用分段共同执行此次重建,使用每个分段来指导彼此。但是,迄今为止,这项工作采用了相对简单的分割方法,例如Chan - VESE算法。在本文中,我们提出了一种使用基于图的分割方法进行联合重建分割的方法,该方法一直在看到最近的兴趣增加。由于涉及的矩阵尺寸较大而引起并发症,我们展示了如何管理这些并发症。然后,我们分析我们方案的收敛属性。最后,我们将此方案应用于``两个母牛''图像的扭曲版本,该版本是先前基于图的分割文献中熟悉的``两个奶牛''图像,首先是高度噪声的版本,其次是模糊的版本,在两种情况下都可以实现高度准确的细分。我们将这些结果与通过顺序重建分割方法获得的结果进行比较,发现我们的方法与重建和分割精度相比,甚至均超过了这些方法。
translated by 谷歌翻译
连续数据的优化问题出现在,例如强大的机器学习,功能数据分析和变分推理。这里,目标函数被给出为一个(连续)索引目标函数的系列 - 相对于概率测量集成的族聚集。这些问题通常可以通过随机优化方法解决:在随机切换指标执行关于索引目标函数的优化步骤。在这项工作中,我们研究了随机梯度下降算法的连续时间变量,以进行连续数据的优化问题。该所谓的随机梯度过程包括最小化耦合与确定索引的连续时间索引过程的索引目标函数的梯度流程。索引过程是例如,反射扩散,纯跳跃过程或紧凑空间上的其他L evy过程。因此,我们研究了用于连续数据空间的多种采样模式,并允许在算法的运行时进行模拟或流式流的数据。我们分析了随机梯度过程的近似性质,并在恒定下进行了长时间行为和遍历的学习率。我们以噪声功能数据的多项式回归问题以及物理知识的神经网络在多项式回归问题中结束了随机梯度过程的适用性。
translated by 谷歌翻译
The analysis of network structure is essential to many scientific areas, ranging from biology to sociology. As the computational task of clustering these networks into partitions, i.e., solving the community detection problem, is generally NP-hard, heuristic solutions are indispensable. The exploration of expedient heuristics has led to the development of particularly promising approaches in the emerging technology of quantum computing. Motivated by the substantial hardware demands for all established quantum community detection approaches, we introduce a novel QUBO based approach that only needs number-of-nodes many qubits and is represented by a QUBO-matrix as sparse as the input graph's adjacency matrix. The substantial improvement on the sparsity of the QUBO-matrix, which is typically very dense in related work, is achieved through the novel concept of separation-nodes. Instead of assigning every node to a community directly, this approach relies on the identification of a separation-node set, which -- upon its removal from the graph -- yields a set of connected components, representing the core components of the communities. Employing a greedy heuristic to assign the nodes from the separation-node sets to the identified community cores, subsequent experimental results yield a proof of concept. This work hence displays a promising approach to NISQ ready quantum community detection, catalyzing the application of quantum computers for the network structure analysis of large scale, real world problem instances.
translated by 谷歌翻译
Efficient surrogate modelling is a key requirement for uncertainty quantification in data-driven scenarios. In this work, a novel approach of using Sparse Random Features for surrogate modelling in combination with self-supervised dimensionality reduction is described. The method is compared to other methods on synthetic and real data obtained from crashworthiness analyses. The results show a superiority of the here described approach over state of the art surrogate modelling techniques, Polynomial Chaos Expansions and Neural Networks.
translated by 谷歌翻译
In the era of noisy intermediate scale quantum devices, variational quantum circuits (VQCs) are currently one of the main strategies for building quantum machine learning models. These models are made up of a quantum part and a classical part. The quantum part is given by a parametrization $U$, which, in general, is obtained from the product of different quantum gates. By its turn, the classical part corresponds to an optimizer that updates the parameters of $U$ in order to minimize a cost function $C$. However, despite the many applications of VQCs, there are still questions to be answered, such as for example: What is the best sequence of gates to be used? How to optimize their parameters? Which cost function to use? How the architecture of the quantum chips influences the final results? In this article, we focus on answering the last question. We will show that, in general, the cost function will tend to a typical average value the closer the parameterization used is from a $2$-design. Therefore, the closer this parameterization is to a $2$-design, the less the result of the quantum neural network model will depend on its parametrization. As a consequence, we can use the own architecture of the quantum chips to defined the VQC parametrization, avoiding the use of additional swap gates and thus diminishing the VQC depth and the associated errors.
translated by 谷歌翻译
Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.
translated by 谷歌翻译
This short paper discusses continually updated causal abstractions as a potential direction of future research. The key idea is to revise the existing level of causal abstraction to a different level of detail that is both consistent with the history of observed data and more effective in solving a given task.
translated by 谷歌翻译
State-of-the-art poetry generation systems are often complex. They either consist of task-specific model pipelines, incorporate prior knowledge in the form of manually created constraints or both. In contrast, end-to-end models would not suffer from the overhead of having to model prior knowledge and could learn the nuances of poetry from data alone, reducing the degree of human supervision required. In this work, we investigate end-to-end poetry generation conditioned on styles such as rhyme, meter, and alliteration. We identify and address lack of training data and mismatching tokenization algorithms as possible limitations of past attempts. In particular, we successfully pre-train and release ByGPT5, a new token-free decoder-only language model, and fine-tune it on a large custom corpus of English and German quatrains annotated with our styles. We show that ByGPT5 outperforms other models such as mT5, ByT5, GPT-2 and ChatGPT, while also being more parameter efficient and performing favorably compared to humans. In addition, we analyze its runtime performance and introspect the model's understanding of style conditions. We make our code, models, and datasets publicly available.
translated by 谷歌翻译
Overfitting is a problem in Convolutional Neural Networks (CNN) that causes poor generalization of models on unseen data. To remediate this problem, many new and diverse data augmentation methods (DA) have been proposed to supplement or generate more training data, and thereby increase its quality. In this work, we propose a new data augmentation algorithm: VoronoiPatches (VP). We primarily utilize non-linear recombination of information within an image, fragmenting and occluding small information patches. Unlike other DA methods, VP uses small convex polygon-shaped patches in a random layout to transport information around within an image. Sudden transitions created between patches and the original image can, optionally, be smoothed. In our experiments, VP outperformed current DA methods regarding model variance and overfitting tendencies. We demonstrate data augmentation utilizing non-linear re-combination of information within images, and non-orthogonal shapes and structures improves CNN model robustness on unseen data.
translated by 谷歌翻译